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standard  error  of  the  equating. 
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In  Item  response  theory  (IRT),  an  examinee's  expected  number-right 
score  £  on  test  X  is  equal  to  the  test  characteristic  function 
evaluated  at  the  examinee's  ability  level  6  : 

n^ 

K  -  E  P  (9)  (1  * ) 

8-1  8 

where  P^(0)  is  the  item  response  function,  the  probability  of  a  cor¬ 
rect  answer  to  item  i  at  ability  level  0  .  If  we  have  a  second  test, 

Y  ,  measuring  the  same  ability  as  X  ,  the  expected  number-right  score 
n  on  this  test  may  be  written  as 

n 

n  -  E  p  (0)  .  <4') 

h-1  h 

Equations  (1')  and  (4')  are  parametric  equations  for  the  functional 
relationship  between  £  and  n  .  Note  that  this  relationship  is  an 
exact  mathematical  one,  not  a  statistical  association.  Given  any  0  , 

(1')  and  (4')  determine  a  pair  of  values,  £  and  n  ,  that  represent 
the  same  ability  level  as  0  .  Pairs  of  values  C£ , n)  determined  in 
this  way  are  equated .  In  practice,  it  is  often  assumed  that  the 
functional  relationship  of  n  to  £  given  by  (1')  and  (4')  can  also 
be  applied  to  actual  number-right  scores  on  the  two  tests,  producing 
an  equating  of  these  scores. 

*This  work  was  supported  in  part  by  contract  N00014-80-C-0402, 
project  designation  NR  150-453  between  the  Office  of  Naval  Research  and 
Educational  Testing  Service.  Reproduction  in  whole  or  in  part  is  permitted 
for  any  purpose  of  the  United  States  Government. 


Here,  we  simply  deal  with  the  sampling  errors  in  estimating  the 
equating  relationship  of  n  to  £  .  In  (1  )  and  (4*),  estimated 
it-qn  parameters  must  be  used.  These  are  the  source  of  the  sampling 
errors  in  IRT  equating.  Note  that  the  ability  estimates  for  Individual 
examinees  are  not  used  in  (1')  and  (4')  and  thus  will  not  appear  in 
our  formulas.  Until  now,  the  sampling  errors  of  IRT  equatings  have 
never  been  estimated. 


Data 

In  IRT  equating,  we  frequently  have  a  set  of  common  items  that  are 
administered  to  all  examinees.  These  are  needed  in  order  to  get 
Test  Y  item  parameters  on  the  same  scale  as  Test  X  item  parameters. 
If  the  common  items  are  external  to  tests  X  and  Y  ,  as  assumed  here, 
the  common  items  are  called  the  anchor  test,  or,  in  the  present  report. 
Test  W  ,  The  sampling  variance  formulas  to  be  obtained  here  can  be 
modified  in  obvious  ways  for  the  case  where  some  or  all  of  the  common 
items  are  Internal  to  the  tests  that  are  being  equated. 

Designate  the  examinees  who  took  both  Tests  X  and  W  as 
Group  1;  designate  the  examinees  who  took  Tests  Y  and  W  as  Group  2. 
Typically,  every  examinee  falls  in  one  of  these  two  groups. 

In  practice  when  there  is  a  series  of  test  forms  A,B . X,Y, Z, . . 

(say),  the  'Group  1'  data  on  Test  X  are  processed  as  soon  as  they 
become  available  in  order  to  equate  Test  X  to  the  preceding  form. 

When  the  Group  2  data  become  available  at  some  later  date,  it  is 
often  considered  uneconomical  to  rerun  the  Group  1  data,  so  Group  2  is 


run  by  Itself.  This  case,  where  ite&  parameters  for  Groups  1  and  2 
are  estimated  separately,  is  the  case  to  be  considered  here.  (The 
simplifying  assumption  that  is  used  below  to  approximate  the  sampling 
variances  of  the  estimated  item  parameters  is  not  available  in  the 
alternative  case  where  Groups  1  and  2  are  pooled  and  all  parameters 
estimated  simultaneously.) 


New  Equating  Formulas 

When  parameters  are  estimated  separately  for  groups  1  and  2, 
the  item  parameters  and  6  in  (4*)  have  a  different  origin  and  scale 
from  the  item  parameters  and  0  in  (1').  It  is  thus  no  longer 
possible  simply  to  eliminate  0  from  (1')  and  (4*)  to  obtain  the 
relation  of  n  to  £  .  The  customary  procedure  in  this  situation  is 
to  use  the  anchor  test  to  transform  the  Group  2  item  parameters  on  to  the 
scale  of  the  Group  1  item  parameters.  This  procedure  adds  to  the  sampling 
variance  of  the  transformed  item  parameter s  and  greatly  complicates  any 
determination  of  the  sampling  variance  of  the  subsequent  equating.  The 
procedures  and  formulas  given  below  avoid  this  problem  since  they  avoid 
any  transformation  of  item  parameters. 

Equations  (!')  and  (4')  remain  unchanged  except  that  additional 
subscripts  (explained  below)  are  used.  In  particular,  the  symbols 
0^  and  @2  must  be  distinguished  because  groups  1  and  2  use  different 
ability  scales : 

5 ' E  W  >  <» 

8  * 
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n  "  g  Pg4(02) 


(4) 


The  item  response  functions  here  are  written  P  where  p  ■  1,2 ,3, 4 

8P 

refers  to  (test  X  ,  group  1),  (test  W  ,  group  1),  (test  W  ,  group  2), 

and  (test  Y  ,  group  2)  respectively,  and  g  ■  1,2,. ,.,n  where  n 

P  P 

is  the  number  of  items  in  the  appropriate  test. 

Let  us  write  down  similar  equations  for  the  expected  number-right 
score  ui  on  anchor  test  U  : 


0) 


2  W 

o 


(2) 


w  -  E  P  _(8_)  .  (3> 

8  83  2 

The  equation  numbering  keeps  the  tests  in  convenient  order.  The  desired 
equation  relation  between  q  and  5  can  be  obtained  by  eliminating 
91  *  6 2  *  aru*  w  fro®  these  four  equations. 

Computer  programs  are  available  for  equating  q  to  £  by 
eliminating  8  from  (1')  and  (4').  These  same  programs  can  be  used 
to  equate  w  to  £  in  one  step,  using  (1)  and  (2),  then  to  equate 
n  to  u  in  a  second  step  using  (3)  and  (4).  This  produces  an  equating 
of  n  to  £  for  the  presently  relevant  situation  where  Group  1  and 
Group  2  parameters  are  not  on  the  same  scale. 

An  estimated  equating  is  obtained  from  (1)  -  (4)  after  replacing 
the  true  item  parameters  by  their  maximum  likelihood  estimates.  Using 
carets  to  denote  this  change,  we  have 
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«  *  E  pai(9i) 

g  * 


(1") 


«  “  I  p2(e1) 

g  * 


(2") 


u>  -  i  p  ,(eJ 

g  83  2 


(3") 


n  -  E  Pr4(02) 
g  8 


(4") 


These  equations  show  the*-  q  is  a  function  of  all  the  estimated  item 
parameters  together  with  the  specified  value  of  £  . 

Derivatives 


For  item  g  ,  instead  of  using  a  ,  b  ,  and  c  to  denote 

g  g  g 


the  three  parameters  commonly  used  in  IRT,  let  us  use  t,  ,  t-  , 

lgp  ZgP 

and  t^  »  respectively.  We  will  need  certain  derivatives  for 
r  *  1,2,3  ,  obtained  from  (l")-(4")s 


JL.  p<r>(9  ) 
3i  ,  g4 
rg4 


(5) 


i- 


-La-  -  p<r>o  ) 

3t  -  g3 
rg3 


.  a.M-  -  p<r>(e  ) 

,Crg2  82  1  ' 


(r) 

where  P  denotes  the  derivative  of  P  with  respect  to  t 

gP  gP  rgp 


Similarly, 


s^-;w  • 


^  ■  j  vv 


where  P*  denotes  a  derivative  with  respect  to  e  .  Using  the  formula 
for  the  derivative  of  an  implicit  function,  we  also  find  from  (l")-(4") 

for  r  ■  1,2,3 


p(r)  (a  ) 

&3  (9aL 

*  w 

;W  ' 


au  z  ?'3(e2) 

g  8 


Using  the  chain  rule  for  derivatives,  we  find  from  the  above 


formulas: 


Z  p',(e,) 

an  .  Lq_  362  _  _p(r>(e  )  8  g4.  -2- 

3  trg3  3  e2  3  trg3  *3  2  *  Pg3<02) 

© 


(6) 


-7- 


-1.0. 

3trg2 


a n  a®2 


3  u 


1  W 


awatrg2 


*2  {9V  l  P'  (6,) 


g3  2 


(7) 


a  n  a  n  3  e2  Sto  *ei 
3trgl  3  823  W  36,3  trgl 


-<r) 


*  W  *  PiA<®2> 


(r),_  v  g  e  r  °  . 

gi  {ei)  z  z  Pg3<02) 


(8) 


Given  £  ,  we  are  now  In  a  position  to  express  t\  as  a  series  in 

Vgp-^gp  (r-  1,2.3,-  g-1.2 . -p  ;  p-1,2.3,4). 

We  will  write  n*  instead  of  3n/3t  and  n"  u  instead  of 
rgp  rgp  rgpshq 

8  ^  W  '.h,  ! 


n+E  II  d  n  -  t  n)n'„  „ 
p  g  r  rgp  rgp  'rgp 


+  4mm  (t.  -t  )(tv  *  t .  )n"  .  +  ...  .(9) 

2  p  q  g  h  r  a  rgp  r8P  shq  shq  r«Pshq 


Sampling  Variance 

Transposing,  squaring,  and  taking  expectations,  we  find  from 
(9)  for  fixed  $ , 


Var  n 


*  2 
8  (n  -  n) 


E  E  E  E  E  E  n 
p  q  g  h  r  s 


Cov<tr8p-t.h,>  + 


•  a  e 
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When  item  parameters  and  abilities  era  both  estimated  simultaneously 
by  maximum  likelihood,  it  is  not  practical  to  use  the  usual  sampling 
covariance  formulas  for  all  estimators  simultaneously.  As  a  rough 
approximation,  it  is  customary  (Lord,  1980,  Section  12.3)  to  use  instead 
the  (simpler)  formulas  for  the  case  where  the  ability  parameters  are  known 

a  * 

We  will  use  this  rough  approximation  here  to  find  CovU^.t^)  . 

Because  of  this  approximation,  our  sampling  variance  of  equating 

will  be  an  underestimate. 

In  this  case,  all  covariances  involving  two  different  items  are 
exactly  zero,  as  are  all  covariances  involving  a  single  item  administered 
to  two  different  groups  of  examinees.  All  nonsero  variances  and  co- 
variances  are  inversely  proportional  to  N  ,  the  number  of  examinees. 

We  now  have 


Var  n 


3  3 

l  l  [  E  £ 
p  g  r»l  s»l 


{V 


rgP  sgp 


Cov(t.t) } 
rgp  sgp 


3  3  3  3  3  3  3 


Some  higher  order  terms  are  indicated  here  in  order  to  make  clear  that 

the  number  of  terms  under  summation  signs  does  not  increase  too  rapidly. 

The  triple  summation  represents  3  times  as  many  terms  as  the  double 

3/2 

summation,  but  each  term  in  the  triple  summation  is  divided  by  N 
whereas  each  term  in  the  double  summation  is  only  divided  by  N  .  When 
N  is  several  thousand,  it  is  reasonable  to  expect  that  the  higher 
order  terms  can  be  neglected,  as  is  customary  with  asymptotic  variances. 


-9- 


Our  final  asymptotic  formula,  then  is 


4  p  3  3  „ 

Var  n  ■  E  E  E  E  n*  n'  Cov(t  ,t  ) 
p-1  8-1  r-1  s-l  r8P  8gP  rgP  Sgp 


(10) 


The  n'  values  required  here  are  computed  from  (5)  -  (8).  The 
covarainces  are  obtained  by  the  usual  formulas  for  covariances  of  maximum 
likelihood  estimators  of  item  parameters  when  ability  parameters  are 
fixed  (Lord,  1980,  p.  191). 


Practical  Application 

Without  data,  it  is  difficult  to  make  inferences  about  the  magnitude 
of  the  sampling  errors  in  IRT  equating.  Will  they  be  larger  or  smaller 
than  the  sampling  errors  in  conventional  linear  equating?  In  conventional 
equipercentile  equating?  Do  sampling  errors  beco ne  large  or  small  at 
extreme  score  levels? 

Equation  (10)  has  been  applied  to  an  equating  of  the  Verbal  score  on 
the  90-item  Form  VSA4  of  the  Scholastic  Aptitude  Test  (12/73  administration) 
to  the  85-item  Form  XSA2  Verbal  score  (4/75  administration).  All  examinees 
took  an  SAT  and  also  a  40-item  anchor  test.  Petersen,  Cook,  and  Stocking 
(1980)  made  separate  LOGIST  runs  on  the  130  items  in  the  1973  administration 
for  a  sample  of  2665  examinees,  and  on  the  125  items  in  the  1975 
administration  for  a  sample  of  2686  examinees.  They  have  allowed  the 
use  here  of  their  item  parameter  estimates. 
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SAT  scaled  scores  are  a  linear  transformation  of  formula  scores  (rights 
minus  one-quarter  wrongs).  Our  results  here  are  for  the  hypothetical 
case  where  all  examinees  answer  all  items.  In  this  special  case  formula 
scores  are  a  linear  transformation  of  number-right  scores,  so  scaled 
scores  are  likewise.  Since  a  known  linear  transformation  A£  +  B 
of  number-right  scores  £  simply  multiplies  the  standard  error  of 
n  by  the  constant  A  ,  it  is  not  difficult  to  obtain  scaled-score 
standard  errors  from  (10).  A  computer  program  to  do  this  was  written 
and  run  by  Marilyn  Wingersky. 

For  each  of  certain  specified  formula  scores  on  XSA2,  Table  1  shows 
1)  the  equivalent  scaled  score  found  by  the  conventional  linear  procedure 
usually  used  for  the  SAT  (Design  IV  A,  Angoff,  1971),  2)  the  standard  error 
of  these  equated  (scaled)  scores  as  found  by  the  computer  program  AUTEST 
(Lord,  1975)  assuming  the  validity  of  the  linear  model;  also  3)  the  equi¬ 
valent  scaled  score  found  by  the  IRT  method  of  this  report,  and  4)  the 
corresponding  scaled-score  standard  error  calculated  from  (10).  The 
standard  errors  in  Table  1  are  best  understood  in  comparison  with  the 
standard  deviation  of  scaled  scores,  which  is  106  for  XSA2;  and  in 
comparison  with  the  classical  test  theory  standard  error  of  measurement 
(due  to  imperfect  test  reliability),  which  is  31.  Clearly  the  standard 
error  of  equating  is  small  compared  to  the  standard  error  of  measurement. 

Judging  by  the  IRT  standard  errors,  the  equating  is  definitely 
nonlinear,  at  least  outside  the  score  range  from  350  to  650.  The 
IRT  standard  errors  show  a  continued  sharp  increase  as  the  minimum 
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Table  1 

A  Comparison  of  Linear  and  IRT  Equatlngs  and  of  Their  Standard  Errors 


Selected 

formula 

scores*. 

Linear 

Model 

IRT 

Model 

Equivalent 

scaled 

Standard 

Equivalent 

scaled 

Standard 

XSA2 

score 

error 

score 

error 

84 

780 

4.6 

813.8 

2.3 

79.74 

750 

4.2 

778.0 

4.5 

72.70 

700 

3.6 

717.6 

4.4 

65.65 

650 

3.1 

658.8 

3.6 

58.61 

600 

2.5 

602.4 

2.8 

51.57 

550 

2.1 

548.0 

2.2 

44.52 

500 

1.7 

495.4 

2.0 

37.48 

450 

1.5 

445.7 

2.1 

30.43 

400 

1.6 

399.3 

2.3 

23.39 

350 

1.8 

355.6 

2.8 

16.35 

300 

2.3 

313.3 

3.6 

9.30 

250 

2.8 

270.2 

4.7 

2.26 

200 

3.3 

223.0 

7.0 

-5 

150 

3.9 

163.5 

15.6 

*A1 though  formula  score  is  actually  a  discrete  variable,  it 
is  for  convenience  treated  here  as  continuous. 


It 
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possible  true  formula  score  of  -5.5  is  approached.  At  the  other  end  of  the 
score  scale,  the  IRT  standard  error  increases  up  to  a  sca'Ved  score  of  760 
and  decreases  thereafter.  The  reason  for  the  decrease  at  the  upper  end  is 
that  for  a  perfect  score,  the  standard  error  of  this  kind  of  IRT  equating 
is  zero.  Except  at  the  upper  end,  the  IRT  standard  error  is  larger  than 
the  linear. 

The  results  of  Table  1  are  displayed  In  Figures  1-2.  The  straight 
line  in  Figure  1  shows  the  linear  equating  of  true  formula  score  on 
XSA2  to  true  scaled  score  cn  VSA4.  The  dashed  lines  are  drawn  two 
standard  errors  above  and  below  the  straight  line. 

Figure  2  similarly  displays  the  curvilinear  IRT  equating  of  XSA2 
to  VSA4  and  its  standard  error.  The  straight-line  extension  of  the  lower 
end  of  the  equating  (middle)  line  in  Figure  2  was  obtained  by  the  method 
described  in  Lord  (1980,  pp.  210-211).  It  is  shown  In  the  figure  for 
completeness,  but  no  standard  error  is  shown  since  there  is  no  good 
theoretical  basis  for  such  an  extension. 

Table  2  compares  present  IRT  equating  with  a  conventional  equipercen- 

tile  equating  of  XSA2  to  VSA4  via  the  anchor  test.  In  conventional  equating, 

an  XSA2  score  and  a  VSA4  score  each  equipercentile-ly  equivalent 

to  a  given  anchor  test  score  are  taken  to  be  equivalent  to  each  other. 

The  standard  error  of  the  resulting  equipercentile  equating  of  XSA2 

2  2 

to  VSA4  is  given  by  '/^^^'XSA2  +  ^VSA4^  w^ere  the  SE  under  the  radical 
sign  are  standard  errors  of  separate  equipercentilt  equatings  of  each 


test  to  the  anchor  test.  Formulas  for  SE, 


and  SE, 


are  given 


in  Lord  (1981). 


Figure  1.  Linear  equating  of  true  formula  score  on  XSA2  to  true 
scaled  score  on  VSA4.  Dashed  lines  are  two  scaled-score  standard  errors 
above  and  below  equating  line. 
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Table  2 

A  Comparison  of  Equipercentile  and  1RT  Equating 
and  of  Their  Standard  Scores 


Eouloercentlle  Method 

IRT  Model 

XSA2 

formula 

Equivalent 

scaled 

Standard 

Equivalent 

scaled 

Standard 

score 

score 

error 

score 

error 

78.1 

774 

13.47 

764 

4.68 

7C.6 

722 

15.85 

700 

4.18 

64.75 

652 

10.32 

651 

3.44 

58.9 

602 

4.97 

605 

2.78 

52.9 

558 

4.12 

558 

2.32 

47.25 

514 

3.47 

515 

2.09 

40.1 

466 

3.44 

464 

2.05 

32.4 

417 

2.93 

412 

2.24 

25.75 

364 

3.37 

370 

2.63 

16.1 

314 

4.07 

312 

3.62 

7.6 

242 

5.70 

259 

5.08 

-3.75 

195 

7.85 

175 

12.49 
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Slnce  seXSA2  *nd  SEVSA4  are  e8tiroated  from  unamoothed  data, 
the  equipercentile  standard  errors  in  Table  2  fluctuate  somewhat. 
Nevertheless,  It  is  apparent  that  the  equipercentile  method  has  a  much 
larger  standard  error  above  a  scaled  score  of  450.  For  these  data,  the 
IRT  method  shows  a  larger  standard  error  than  the  equipercentile  method 
only  when  the  formula  score  is  negative. 

The  standard  error  of  equipercentile  equating  could  be  reduced  by 
smoothing  the  frequency  distribution  of  raw  scores  before  equating. 
Smoothing  is  undoubtedly  desirable  as  a  practical  expedient;  however  the 
choice  of  a  smoothing  formula  is  somewhat  arbitrary  and  the  smoothing  is 
likely  to  prevent  convergence  of  the  estimated  equating  to  its  true  value 


in  large  samples.  Formulas  for  the  standard  errors  of  smoothed  equipercentile 


equating  are  not  presently  available. 

In  order  to  determine  the  effect  of  using  a  shorter  anchor  test, 
every  other  item  in  the  anchor  test  was  discarded  and  the  data 


reanalyzed  on  the  basis  of  the  remaining  20-item  anchor  test.  The 
effect  on  the  standard  errors  of  IRT  equating  in  shown  in  Table  3. 

The  two  equatings  agree  fairly  well.  At  the  point  where  the  equating 
standard  errors  are  a  minimum,  halving  the  length  of  the  anchor  test 
increases  the  standard  error  by  a  factor  of  about  /2  .  At  the  other 


score  points,  the  effect  is  less.  Given  standard  errors  like  those  in 


Table  2,  it  will  now  be  possible  to  make  a  reasonable  judgment  as  to  the 
length  necessary  for  an  anchor  test. 
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Table  3 

IRT  Equatlngs  and  Their  Scaled-Score  Standard  Errors, 
a  Comparison  of  Results  Using  20-  and  40-Item  Anchor  Tests 


Length  of  Anchor  Test 


XSA2 

formula 

20 

Items 

40 

Items 

Scaled 

Standard 

Scaled 

Standard 

score 

score 

error 

score 

error 

80 

787 

5.9 

780 

4.5 

70 

698 

5.3 

695 

4.1 

60 

615 

3.9 

613 

2.9 

50 

540 

3.0 

536 

2.2 

40 

467 

2.7 

463 

2.0 

30 

399 

3.0 

397 

2.4 

20 

336 

3.9 

335 

3.2 

10 

274 

5.4 

275 

4.6 

0 

206 

9.9 

206 

8.4 
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